Wines Exploration by José Gildo

This work aims to compare red and white wines datasets. Both datasets are available on the dataset options here for this project.

There main question that we will try to answer is:

This report explores a dataset of red and white wines about many perspectives. Red wines dataset has information about 1,599 wines. White wines dataset has information about 4,898 wines. Both databases have 6,497 lines and 13 variables.

Univariate Plots Section

Red Wines:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 2.200   Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

White Wines:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 2.200   Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Compare both classes of wine for each attributes:

Histogram over all variables on the database:

Red and White wines over all attributes in median values:

Best and Worse Red Wines comparation:

Best and Worse White Wines comparation:

Univariate Analysis

What is the structure of your dataset?

Red wines dataset has information about 1599 wines. White wines dataset has information about 4898 wines. Both databases have 6,497 lines and 13 variables.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. We would like to determine which are best and minimal combination of features for determine the quality of a wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Others features that will help our analysis for both wines: age of wines, kind of grapes, price of the botter, region of wine, is a blend or not. For Red Wines we have visible differences when we compare hight and low quality wines. It is possible notice that alcohol, citric.acid and volatile.acidity are (apparently) inversely proportional. However, white wines have a remarkable difference in alcohol attribute and subtle differences in pH and density.

Did you create any new variables from existing variables in the dataset?

No. I created a new dataset joining red and white wines datasets.

Of the features you investigated, were there any unusual distributions?

It was necessary to adjust the dataset to make them tailored to use libraries to build the presented graphs.

Some observations:

  • Alcohol: In general, both wines (red and white) have the same distribution of alcoholic graduation but red wines have more alcohol than white wines. An interesting point is that we found white wines with 14% of alcohol concentration and red wines with 8% of alcohol concentration;

  • pH: In general, red wines have a pH bigger than white wines. At this point we must to do two considerations: 1) pH is a logarithm scale and does it mean that the small differences in this scale represents differences in fact of 10x; 2) When ph values are small it means an acid environment. Otherwise, when ph is increasing we have an alkaline environment. We can observe that ph and citric acid are inversely proportional and this is confirmed in our dataset. White wines are more acid and red wines are more alcoholic;

Acidity:

  • Citric Acid: In general, white wines are more citric than red wines and it is natural due to the grapes used in the process;

  • Volatile Acidity: In geral, red wines have more volatile acidity than white wines;

  • Fixed Acidity: In general, red wines have more fixed acidity than white wines;

  • Chlorides: In general, red wines have more chlorides than white wines probably relate to the physical-chemical production process. For both wines are many variability about this attribute;

  • Density: In general, red wines have more density than white wines. Density is an important factor to harmonize with fat because of that it is common to serve red wine with fatty meats. This is an expected result. White wines are refreshing and much density is not interesting for this propouse;

- Sulphates: In general, red wines have more sulphates than white wines, but for both wines are a low variability for this variable.

Sulfur:

- Total and Free Sulfur Dioxide: Based in Sulfur Dioxide is used to prevent oxidation and microbial growth. However, an excessive amounts of SO2 can inhibit fermentation and cause undesirable sensory effects.

- Residual Sugar: In general, red wines have next to nothing residual sugar. White wines have more variability and more residual sugar than red wines. The distribution of this variability seems to be skewed;

- Quality: Even with different combinations of attributes, both wines arrives similar quality.

Conclusions:

  • White wines are more acid and red wines are more alcoholic;
  • White wines have acidity attributes more homogeneous than red wines;
  • When we compare all wines we can notice differences to three variables: volatile acid, ph and total sulfur dioxide. If we consider only the quality in red wines it is possible notice that alcohol, citric.acid and volatile.acidity are (apparently) inversely proportional. However, for white wines we have a remarkable difference in alcohol attribute and subtle differences in pH and density;
  • It is possible that the red wines with elevate values to sulphates attribute can be related with preserve techniques to artificial wines production.

Bivariate Plots Section

There are many interesting things this graph shows to us:

All Wines:

Red Wines: We are interested in understanding the behavior of quality over other variables considering just red wines.

White Wines:

Now we are interested in understanding the behavior of quality over other variables considering just white wines.

Conclusions

All Wines The differences between best and worst wines are subtle for both types of wines (red and white). The data guide us to understand that more alcohol and citric acid associate with less density and chlorides is related to max quality in both types of wine. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid. Maybe chlorides and sulphates are substances added to process to get the balance of the wine.

Red Wines If we observe just red wines, maximum quality it is obtained when:

  • Alcohol > 10 (in the most of the cases);
  • Density < 1;
  • Fixed Acid < 13;
  • Low volatility > 0.2 < 0.9;
  • If we analyze Free Sulfur Dioxide, Total Sulfur, Fixed Sulfur and Residual Sugar best and worst quality have similar proportions of these elements and it is unlikely these variables can determine quality in red wines. At the same time, we found maximum quality with ph lower than 3 and upper 3.5. Perhaps, ph be the result of the combination of other variables and not necessarily a determining variable.

Best Best Red Wines has more alcohol, more citric acid, more sulphates, less ph, less density and chlorides. These attributes show the contrast between best and worse red wines.

White Wines If we observe just red wines, maximum quality it is obtained when:

  • The variables are more defined than red wines;
  • Alcohol > 10 < 13;
  • Ph > 3.15 < 3.5
  • Volatility > 0.2 < 0.4
  • Fixed Acid > 6 < 9
  • Chlorides < 0.05
  • Density > 0.9 < 1
  • Sulphates > 0.3 < 0.7

Best Best white Wines has more alcohol, more citric acid, more sulphates, less density and chlorides when compared with worse wines. However, more ph while best red wines has less ph when we compare best and worse wines.

General Conclusions * For both types of wines the variables more related with quality are: alcohol, density and citric acid. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid;

  • Best red wines are more diverse. It is possible to find best red wines with different values of variables. Maybe, different grapes, or blend, or recipes even with less alcohol, and more density, red wines show to us that max quality is possible to archive with many ways. However, max quality in white wines are much more specific. This characteristic suggest to us that max quality in white wines follow a patron like a recipe. There is small margins and variability between variables in max quality and worse quality.

Outliers Analysis: There is a white wine with max quality and a small percentage of alcohol. This is an interesting outlier to be analyzed. It is possible to realize that on this particular case the small percentage of alcohol was associated with higher values to residual sugar, fixed acidity and density. Maybe to give to this exemplar the balance needed.

Multivariate Plots Section

At this point we can see alcohol and density related with quality for all wines, but not acidity. We also tried to relate acidity variables with alcohol and density but without success. After that, we tried do build a function that relate alcohol and density to explain quality. We build three models to explain quality:

\[ f(a,d) = \sqrt{a . d} \] * 2) Geometric mean between alcohol, density and citric acid;

\[ f(a,d,c) = \sqrt[3]{a . d . c} \] * 3) Proportion between alcohol and density;

\[ f(a,d) = \frac{a}{d} \]

The first model understand quality as a balance between alcohol and density. The second model understand quality as balance between alcohol, density and acidity (what is very related with the reality). Third model understand quality as a proportion between alcohol and density. We check the correlation results with quality values to measure the ability of model do explain the quality variable.

## Warning: Removed 2 rows containing missing values (geom_segment).
## Warning: Removed 2 rows containing missing values (geom_point).

### Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the

strengths and limitations of your model.


Multivariate Analysis

Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Reflection

Maybe we need more information about the wines, like year of production, grapes used to production, local of production (terroir) and other variables related to taste. We know that alcohol is able o explain about 45% of quality variable. Maybe join alcohol with other variables we can determine quality as so a evaluator.